P. Jönsson and C. Wohlin, "benchmarking K-nearest Neighbour Imputation with Homogeneous Likert Data", Empirical Software Engineering: an Benchmarking K-nearest Neighbour Imputation with Homogeneous Likert Data

نویسنده

Claes Wohlin

چکیده

Missing data are common in surveys regardless of research field, undermining statistical analyses and biasing results. One solution is to use an imputation method, which recovers missing data by estimating replacement values. Previously, we have evaluated the hot-deck k-Nearest Neighbour (kNN) method with Likert data in a software engineering context. In this paper, we extend the evaluation by benchmarking the method against four other imputation methods: Random Draw Substitution, Random Imputation, Median Imputation and Mode Imputation. By simulating both non-response and imputation, we obtain comparable performance measures for all methods. We discuss the performance of k-NN in the light of the other methods, but also for different values of k, different proportions of missing data, different neighbour selection strategies and different numbers of data attributes. Our results show that the k-NN method performs well, even when much data are missing, but has strong competition from both Median Imputation and Mode Imputation for our particular data. However, unlike these methods, k-NN has better performance with more data attributes. We suggest that a suitable value of k is approximately the square root of the number of complete cases, and that letting certain incomplete cases qualify as neighbours boosts the imputation ability of the

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Study of K-Nearest Neighbour as an Imputation Method

Data quality is a major concern in Machine Learning and other correlated areas such as Knowledge Discovery from Databases (KDD). As most Machine Learning algorithms induce knowledge strictly from data, the quality of the knowledge extracted is largely determined by the quality of the underlying data. One relevant problem in data quality is the presence of missing data. Despite the frequent occu...

متن کامل

An Analysis of Four Missing Data Treatment Methods for Supervised Learning

One relevant problem in data quality is the presence of missing data. Despite the frequent occurrence and the relevance of missing data problem, many Machine Learning algorithms handle missing data in a rather naive way. However, missing data treatment should be carefully thought, otherwise bias might be introduced into the knowledge induced. In this work we analyse the use of the k-nearest nei...

متن کامل

A Short Note on Using Multiple Imputation Techniques for Very Small Data Sets

This short note describes a simple experiment to investigate the value of using multiple imputation (MI) methods [2, 3]. We are particularly interested in whether a simple bootstrap based on a k-nearest neighbour (kNN) method can help address the problem of missing values in two very small, but typical, software project data sets. This is an important question because, unfortunately, many real-...

متن کامل

Convergence of random k-nearest-neighbour imputation

Random k-nearest-neighbour (RKNN) imputation is an established algorithm for filling in missing values in data sets. Assume that data are missing in a random way, so that missingness is independent of unobserved values (MAR), and assume there is a minimum positive probability of a response vector being complete. Then RKNN, with k equal to the square root of the sample size, asymptotically produ...

متن کامل

Missing Data Imputation Based on Grey System Theory

This paper proposed a new weighted KNN data filling algorithm based on grey correlation analysis (GBWKNN) by researching the nearest neighbor of missing data filling method. It is aimed at that missing data is not sensitive to noise data and combined with grey system theory and the advantage of the K nearest neighbor algorithm. The experimental results on six UCI data sets showed that its filli...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2007

P. Jönsson and C. Wohlin, "benchmarking K-nearest Neighbour Imputation with Homogeneous Likert Data", Empirical Software Engineering: an Benchmarking K-nearest Neighbour Imputation with Homogeneous Likert Data

نویسنده

چکیده

منابع مشابه

A Study of K-Nearest Neighbour as an Imputation Method

An Analysis of Four Missing Data Treatment Methods for Supervised Learning

A Short Note on Using Multiple Imputation Techniques for Very Small Data Sets

Convergence of random k-nearest-neighbour imputation

Missing Data Imputation Based on Grey System Theory

عنوان ژورنال:

اشتراک گذاری